Introduction

Column

Introduction to the Study

What Makes a Movie Successful

For this final presentation I have chose to look at a data set containing information from various movies. The data set looks at films from the year 1980, all the way to the year 2020, giving us 40 years worth of movies. These movies are not genre specific either. There are a variety of different genres within this data set. The movies data set also have different writers, directors, main actors/actresses, budget, country of origin, run times, and ratings. With all of these variables, my goal was to find trends in the keys to success within films. For the sake of this study, success is defined solely based off of ratings and gross revenue.

My final project attempts to answer some of the questions below:

  • Is there a correlation between budget and gross revenue?

  • Do the ratings of a movie (good or bad) indicate the gross revenue (low or high)?

  • Does the run-time of the movie affect the rating and revenue?

  • Do certain genres of films have better ratings or a larger gross revenue than others?

To do list

  • add trend line to scatter plots
  • make a map about average revenue per location (try to merge data set with some place that has longitude and lattitude along with country name. Your data set does not have long and lat so a map wouldn’t work, hint merge left)
  • average revenue by genre boxplot

Table of Data

Column

Variable Explanation

  • name: This is the title of the film/movie.
  • rating: The rating of the movie (Approved, G, PG, PG-13, R, X, Unrated, TV-PG, TV-14, TV-MA, and NC-17).
  • genre: Genre of the movie (Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, and Western).
  • year: The year of the release.
  • released: The release date (YYYY-MM-DD).
  • score: IMDb user rating.
  • votes: Number of user votes.
  • director: Name of the director.
  • writer: Name of the writer.
  • star: The name of the main actor/actress.
  • country: Origin of the movie.
  • budget: The budget of the movie in USD (Some movies don’t have this, so it appears as 0).
  • gross: Gross revenue of the movie in USD.
  • company: The name of the production company.
  • runtime: The length of the movie in minutes.

Rating-Revenue Corr.

Column

Rating vs Revenue (1980-1989)

Rating vs Revenue (1990-1999)

Rating vs Revenue (2000-2009)

Rating vs Revenue (2010-2020)

Column

Analysis

Rating vs. Revenue

From the scatterplots data,

Budget-Revenue Corr.

Column

Budget vs Revenue (1980-1989)

Budget vs Revenue (1990-1999)

Budget vs Revenue (2000-2009)

Budget vs Revenue (2010-2020)

Column

Analysis

Budget vs. Revenue

From this data,

Runtime-Score Corr.

Column

Runtime vs. Score

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   55.0    95.0   104.0   107.3   116.0   366.0       4 

Column

Analysis

Runtime vs. Score

There is little correlation between score and the runtime of a movie. Majority of the movies fall between 95 and 116 minutes (1 hour 35 minutes to 1 hour and 56 minutes), with an average run time of 107 minutes (an hour and 47 minutes). Movies that fall between these runtimes, have a variety of scores. Some are scored a little over two, while movies with the same runtime score over an eight. However, it does seem that movies that are a little longer generally score a little higher than the mean as shown by the chart’s shape.

Genre and Rating

Column

Average Genre Revenue

Average Rating Revenue

---
title: "Unboxing the Box Office"
author: "Jonah Mergler"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: minty5
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
pacman::p_load(tidyverse, knitr, DT, plotly, maps)
df <- read_csv("movies.csv")
```

<style>
.chart-title { /* chart_title */
  font-size: 18px;
  }
body{ /* Normal */
  font-size: 16px;
  }
</style>


Introduction 
===

Column {.tabset data-width=650}
---

### Introduction to the Study

**What Makes a Movie Successful**

For this final presentation I have chose to look at a data set containing information from various movies. The data set looks at films from the year 1980, all the way to the year 2020, giving us 40 years worth of movies. These movies are not genre specific either. There are a variety of different genres within this data set. The movies data set also have different writers, directors, main actors/actresses, budget, country of origin, run times, and ratings. With all of these variables, my goal was to find trends in the keys to success within films. For the sake of this study, success is defined solely based off of ratings and gross revenue.

My final project attempts to answer some of the questions below:

- Is there a correlation between budget and gross revenue?

- Do the ratings of a movie (good or bad) indicate the gross revenue (low or high)?

- Does the run-time of the movie affect the rating and revenue?

- Do certain genres of films have better ratings or a larger gross revenue than others?

**To do list**

- add trend line to scatter plots
- make a map about average revenue per location (try to merge data set with some place that has longitude and lattitude along with country name. Your data set does not have long and lat so a map wouldn't work, hint merge left)
- average revenue by genre boxplot

### Table of Data

```{r Q0}
DT::datatable(df)
```

Column {data-width=350}
---

### Variable Explanation

- **name**: This is the title of the film/movie.
- **rating**: The rating of the movie (Approved, G, PG, PG-13, R, X, Unrated, TV-PG, TV-14, TV-MA, and NC-17).
- **genre**: Genre of the movie (Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, and Western).
- **year**: The year of the release.
- **released**: The release date (YYYY-MM-DD).
- **score**: IMDb user rating.
- **votes**: Number of user votes.
- **director**: Name of the director.
- **writer**: Name of the writer.
- **star**: The name of the main actor/actress.
- **country**: Origin of the movie.
- **budget**: The budget of the movie in USD (Some movies don't have this, so it appears as 0).
- **gross**: Gross revenue of the movie in USD.
- **company**: The name of the production company.
- **runtime**: The length of the movie in minutes.


Rating-Revenue Corr.
===

Column {.tabset data-width=700}
---

### Rating vs Revenue (1980-1989)

``` {r Q1}
df <- df %>% 
  mutate(year = as.numeric(year))

df <- mutate(df, decade = case_when(
  year >= 1980 & year <=1989 ~ "1980's",
  year >= 1990 & year <=1999 ~ "1990's",
  year >= 2000 & year <=2009 ~ "2000's",
  year >= 2010 & year <=2020 ~ "2010's (including 2020)"))

df1980 <- df[df$decade == "1980's",]

ggplot(df1980, aes(x = score, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Score: ", score)), 
             color = "#C8A2C8", shape = 21) +
  geom_smooth(se = F, color = "#3eb489") +
  labs(title = "Gross Revenue by Rating (19980's)",
       x = "Score (out of ten)",
       y = "Gross Revenue ($)") -> p1

ggplotly(p1, tooltip = "text")
```

### Rating vs Revenue (1990-1999)

``` {r Q2}
df1990 <- df[df$decade == "1990's",]

ggplot(df1990, aes(x = score, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Score: ", score)),
             color = "#C8A2C8", shape = 21) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Rating (1990's)",
       x = "Score (out of ten)",
       y = "Gross Revenue ($)") -> p2

ggplotly(p2, tooltip = "text")
```

### Rating vs Revenue (2000-2009)

``` {r Q3}
df2000 <- df[df$decade == "2000's",]

ggplot(df2000, aes(x = score, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Score: ", score)),
             color = "#C8A2C8", shape = 21) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Rating (2000's)",
       x = "Score (out of ten)",
       y = "Gross Revenue ($)") -> p3

ggplotly(p3, tooltip = "text")
```

### Rating vs Revenue (2010-2020)

``` {r Q4}
df2010 <- df[df$decade == "2010's (including 2020)",]

ggplot(df2010, aes(x = score, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Score: ", score)),
             color = "#C8A2C8", shape = 21)+
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Rating (2010's)",
       x = "Score (out of ten)",
       y = "Gross Revenue ($)") -> p4

ggplotly(p4, tooltip = "text")
```

Column {data-width=300}
---

### Analysis

**Rating vs. Revenue**

From the scatterplots data,  


Budget-Revenue Corr.
===

Column {.tabset data-width=700}
---

### Budget vs Revenue (1980-1989)

``` {r Q5}
ggplot(df1980, aes(x = budget, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Budget: ", budget, "\n",
                              "Score: ", score)),
             color = "#613613", shape = 1) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Budget vs Revenue (1980's)",
       x = "Budgetc($)", 
       y = "Gross Revenue ($)") -> p5

ggplotly(p5, tooltip = "text")
```

### Budget vs Revenue (1990-1999)

``` {r Q6}
ggplot(df1990, aes(x = budget, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Budget: ", budget, "\n",
                              "Score: ", score)),
             color = "#613613", shape = 1) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Budget (1990's)",
       x = "Budget ($)",
       y = "Gross Revenue ($)") -> p6

ggplotly(p6, tooltip = "text")
```

### Budget vs Revenue (2000-2009)

``` {r Q7}
ggplot(df2000, aes(x = budget, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Budget: ", budget, "\n",
                              "Score: ", score)),
             color = "#613613", shape = 1) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Budget (2000's)",
       x = "Budget ($)",
       y = "Gross Revenue ($)") -> p7

ggplotly(p7, tooltip = "text")
```

### Budget vs Revenue (2010-2020)

``` {r Q8}
ggplot(df2010, aes(x = budget, y = gross)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Revenue: ", gross, "\n",
                              "Budget: ", budget, "\n",
                              "Score: ", score)),
             color = "#613613", shape = 1) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Gross Revenue by Budget (2010's)",
       x = "Budget ($)",
       y = "Gross Revenue ($)") -> p8

ggplotly(p8, tooltip = "text")
```

Column {data-width=300}
---

### Analysis

**Budget vs. Revenue**

From this data,

Runtime-Score Corr.
===

Column {data-width=700}
---

### Runtime vs. Score

``` {r Q9}
ggplot(df, aes(x = score, y = runtime)) +
  geom_point(aes(text = paste0(name, " \n",
                              "Runtime: ", runtime, "\n",
                              "Score: ", score, "\n",
                              "Revenue: ", gross)),
             color = "#4a6274", shape = 5) +
  geom_smooth(se = F, color = "#3EB489") +
  labs(title = "Runtime vs Score",
       x = "Score (out of ten)", 
       y = "Runtime (minutes)") -> p9

ggplotly(p9, tooltip = "text")

summary(df$runtime)
```

Column {data-width=300}
---

### Analysis

**Runtime vs. Score**

There is little correlation between score and the runtime of a movie. Majority of the movies fall between 95 and 116 minutes (1 hour 35 minutes to 1 hour and 56 minutes), with an average run time of 107 minutes (an hour and 47 minutes). Movies that fall between these runtimes, have a variety of scores. Some are scored a little over two, while movies with the same runtime score over an eight. However, it does seem that movies that are a little longer generally score a little higher than the mean as shown by the chart's shape.

Genre and Rating
===

Column {.tabset data-width=700}
---

### Average Genre Revenue 

``` {r Q10}
df$gross <- as.numeric(df$gross)

df <- mutate(df, new_genre = case_when(
  genre == "Action" ~ "Action",
  genre == "Adventure" ~ "Adventure",
  genre == "Animation" ~ "Animation",
  genre == "Biograpghy" ~ "Biograpghy",
  genre == "Comedy" ~ "Comedy",
  genre == "Crime" ~ "Crime",
  genre == "Drama" ~ "Drama",
  genre == "Horror" ~ "Horror",
  genre == "Family" | genre == "Fantasy" | genre == "History"
  | genre == "Music" | genre == "Musical" | genre == "Mystery"
  | genre == "Romance" | genre == "Sci-Fi" | genre == "Sport"
  | genre == "Triller" | genre == "Western" ~ "Other")) 

ggplot(df, aes(x = new_genre, y = gross)) +
  geom_boxplot(color = "black", fill = "#f88379") +
  ylim(c(0,100000000)) +
  labs(title = "Revenue by Genre", x = "Genre", y = "Gross Revenue") -> b

ggplotly(b)
```

### Average Rating Revenue

``` {r Q11}
df <- mutate(df, new_rating = case_when(
  rating == "G" ~ "G",
  rating == "PG" ~ "PG",
  rating == "PG-13" ~ "PG-13",
  rating == "R" ~ "R",
  rating == "Approved" | rating == "NC-17" | rating == "Not Rated" |
    rating == "TV-14" | rating == "TV-MA" | rating == "TV-PG" |
    rating == "Unrated" | rating == "X" ~ "Other"))

ggplot(df, aes(x = new_rating, y = gross)) +
  geom_boxplot() +
  ylim(c(0,1e+08))
```